This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.
You can use {include X} to include different sections of your report as separate .qmd files. This is also well documented in the Quarto documentation: https://quarto.org/docs/authoring/includes
As mentioned in the documentation, we have used (_) prefix for the included files (e.g., _introduction.qmd and _data.qmd). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).
Rendering only report.qmd will render also all the other files.
1 Introduction
VENETO
Obesity has become a major global health issue, with its prevalence tripling since 1975. According to the WHO, 1.9 billion adults were overweight in 2022, with over 650 million classified as obese. In Latin America and the Caribbean, the problem is particularly pressing: as of 2022, 25% of adults were obese, with rates reaching 36.1% in Mexico and around 28% and 23% in Peru and Colombia, respectively. These alarming trends contribute to rising cases of obesity-related diseases, such as diabetes and cardiovascular issues. This project, using data from Mexico, Peru, and Colombia—77% of which is synthetically generated via SMOTE and 23% collected from 498 participants online—seeks to explore how lifestyle factors contribute to obesity in these regions. While the synthetic nature of the data limits real-world applicability, this scenario allows for the practical application of concepts from the “Data Science in Business Analytics” course, enabling us to identify key patterns in dietary habits and physical activity that contribute to obesity.
Our primary goal is to identify the most significant behavioral factors contributing to obesity in these countries by conducting exploratory analyses on lifestyle patterns and building a regression and a predictive model based on factors like diet, activity level, and demographics. Visualizations will also be developed to illustrate findings and relationships clearly, enhancing stakeholder understanding of the insights derived. Although synthetic data limits the findings’ applicability, this exercise provides valuable training in data analysis techniques and the potential insights obtainable from comprehensive, real-world data.
The main research questions this project addresses include identifying which lifestyle factors significantly impact obesity in these regions and exploring whether obesity can be predicted based on these factors. By focusing on key lifestyle elements—diet and physical activity—that influence obesity, the data used is tailored to the cultural contexts of Mexico, Peru, and Colombia. Through these insights, we aim to inform public health initiatives, providing actionable data for healthcare organizations and policymakers to address the growing obesity crisis effectively.
1.0.1 2. Data
We planned to acquire data from a publicly available dataset.
The dataset used for this project, titled “Estimation of Obesity Levels Based on Eating Habits and Physical Condition,” was sourced from the UCI Machine Learning Repository.
This dataset, available in CSV, was originally compiled by researchers at the Universidad de la Costa, Colombia, and includes of both synthetically generated data and user-collected data. The 23% of the data was collected through a web page using a survey accesible online for 30 days, in which 498 individuals provided information regarding their dietary habits, physical activity levels, and demographic data. The remaining 77% of the dataset was generated synthetically using the SMOTE algorithm (Synthetic Minority Over-sampling Technique) in Weka. SMOTE was applied to balance the dataset, addressing issues of class imbalance by generating synthetic examples for minority classes.
At the end, the obtained dataset contains 17 attributes and 2111 records.
There are limitations and challenges associated with using this data. First, the reliance on synthetic data means that the results may not accurately represent real-world scenarios, as it lacks the nuances and variability present in genuine human behaviors. Second, while the user-collected data can provide valuable insights, it may be subject to biases, such as self-reporting inaccuracies and sampling biases, which can impact the reliability of our findings. Additionally, gathering data from diverse geographical regions might pose challenges in reaching a representative sample, and we must ensure that the survey is accessible and engaging to participants to encourage participation.
Below, the steps related to the conducted analyses will be outlined.
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female 21 1.62 64.0 yes no 2 3
2 Female 21 1.52 56.0 yes no 3 3
3 Male 23 1.80 77.0 yes no 2 3
4 Male 27 1.80 87.0 no no 3 3
5 Male 22 1.78 89.8 no no 2 1
6 Male 29 1.62 53.0 no yes 2 3
CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
1 Sometimes no 2 no 0 1 no Public_Transportation
2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
3 Sometimes no 2 no 2 1 Frequently Public_Transportation
4 Sometimes no 2 no 2 0 Frequently Walking
5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
6 Sometimes no 2 no 0 0 Sometimes Automobile
NObeyesdad
1 Normal_Weight
2 Normal_Weight
3 Normal_Weight
4 Overweight_Level_I
5 Overweight_Level_II
6 Normal_Weight
Load required libraries for data manipulation, visualization, and clustering. Each package serves a specific purpose:
dplyr: For data manipulation (e.g., filtering, summarizing).
tidyr: For data tidying (e.g., reshaping).
ggplot2: For visualization.
corrplot: For correlation matrix visualization.
ggridges: For creating ridge plots.
cluster: For clustering algorithms.
reshape2: For data reshaping, especially during visualization.
Missing values are identified by counting NA values for each column. All columns contain complete data, with no missing values. If missing data were present, we could address it by either removing rows with missing values using dataset <- na.omit(dataset_row) or imputing missing values with appropriate measures (e.g. mean or median).
Check the structure of the dataset to identify data types for each variable. This helps in identifying columns that need to be converted or standardized.
We convert specific columns to factors for categorical interpretation during analysis. Factors ensure proper handling of discrete variables in statistical modeling.
We arranged the levels of the obesity categories, food consumption between meals, and the frequency of alcohol use to follow a logical ordinal progression, ensuring these variables accurately reflect increasing severity or frequency for improved interpretability and analysis.
Check the number of rows after removing duplicates.
Code
nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE
1.0.1.2 2.2 In-depth analysis of SMOTE’s impact and visualization of class Distribution
Code
ggplot(dataset, aes(x = obesity_lev)) +geom_bar(fill ="skyblue", color ="black") +theme_minimal() +labs(title ="Class Distribution of Obesity Levels",x ="Obesity Level",y ="Count" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) #Adjusted the text for clarity
After applying SMOTE, the distribution is noticeably more balanced across all categories, with each class showing a similar count. This outcome reflects SMOTE’s intended effect of addressing class imbalance.
1.0.1.3 2.3 Distribution analysis
Density plot for age.
Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Age Distribution by Obesity Levels",x ="Age",y ="Density",fill ="Obesity Level") +xlim(14, 50) # Limit the x-axis to 0–50
This graph allows us to assess the age distribution across obesity levels and to evaluate the impact of the SMOTE algorithm in generating synthetic data. Two key takeaways emerge: first, the distributions show a clear separation between obesity categories, particularly with younger ages dominating in lower obesity levels (e.g., Insufficient Weight and Normal Weight) and older ages appearing more prominently in higher obesity levels (e.g., Obesity Type II and III). Second, sharp peaks, such as the one around age 30 in “Obesity Type I,” could signal potential artifacts from data synthesis. While these patterns indicate that the dataset maintains logical trends, further validation is necessary to confirm that these separations and peaks reflect realistic population characteristics and not artificial biases introduced during data augmentation. Overall, the dataset appears well-structured, but these observations warrant careful consideration during analysis.
The summary statistics show relatively consistent means and standard deviations for Age, Height, and Weight across obesity levels, which suggests that SMOTE has preserved the overall distribution without introducing extreme values. Interpretation: Since the means and standard deviations are similar across classes, it appears SMOTE didn’t drastically alter the dataset’s variability. This consistency supports the idea that SMOTE effectively balanced the classes without distorting key variable distributions.
Perform K-means clustering and calculate silhouette score.
Silhouette Score from K-means Clustering: The mean silhouette score of approximately 0.456 suggests a moderate level of cohesion within clusters and some separation between them. This score indicates that the clusters (representing obesity levels) are neither too distinct nor too blended. Interpretation: A score close to 0.5 generally reflects reasonable class separability without excessive artificial separability. This score suggests that SMOTE has helped create distinguishable but not overly isolated clusters, which is desirable for class balance. We conclude that SMOTE has balanced the dataset without drastically distorting it.
The results of the test confirmed that there are no NA values in the dataset, indicating that all variables were successfully converted to numeric format while retaining their integrity.
1.0.1.4 2.4 Correlations
In order to select the possible factor influencing obesity level.
We computed a correlation matrix to analyze the relationships between numeric variables, focusing on their associations with obesity_lev. Variables were reordered by the strength of their correlation with obesity_lev for clarity. A heatmap was generated using a diverging color gradient to visualize these correlations, with red indicating strong positive relationships, blue for negative, and white for weak or neutral. Numerical labels and rotated axis labels were added to improve interpretability, highlighting key factors linked to obesity levels.
Code
#Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age","weight","height", "family_hist", "caloric_food","vegetable_food", "food_btw_meals", "use_tech", "ch2o","m_trans", "smoke","nb_meal_day", "calorie_check","gender"),use ="complete.obs")#Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]#Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))#Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]#Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5 , hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityaxis.text.y =element_text(angle =45, vjust =1) # Rotate y-axis labels for readability )
Code
# Create the heatmap with correlation values# Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age","weight", "family_hist", "caloric_food","vegetable_food", "food_btw_meals","use_tech","ch2o", "height","calorie_check", "gender"),use ="complete.obs")# Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]# Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))# Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]# Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5 , hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityaxis.text.y =element_text(angle =45, vjust =1) # Rotate y-axis labels for readability )
2 3. Exploratory Data Analysis (EDA)
2.0.0.1 3.1 Descriptive statistics and distribution analysis
2.0.0.1.1 Age
Descriptive statistic for Age
Code
summary(dataset$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.00 19.92 22.85 24.35 26.00 61.00
Code
sd(dataset$age, na.rm =TRUE)
[1] 6.368801
Age distribution
The age data shows a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years. The range (14 to 61 years) covers a wide age span, but most individuals are concentrated in the 20–30 age range. The standard deviation (6.35 years) suggests moderate variability in the dataset. This young population distribution may limit the applicability of results to older age groups, where obesity risk factors could differ.
Age Distribution by Obesity Level (Violin Plot)
The age distribution varies across obesity levels,highlighting distinct trends. Insufficient and normal weight categories are concentrated among younger individuals (14–30), while overweight and obesity levels shift towards mid-adulthood (20–40), peaking around 30–35 years. Severe obesity (Type III) is rare in younger ages and more common in the 30–40 range. These patterns suggest the progression of weight issues with age and emphasize the need for targeted interventions during early to mid-adulthood to prevent worsening obesity levels.
Code
ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +geom_violin(trim =FALSE, alpha =0.6) +geom_boxplot(width =0.1, color ="black", fill ="white") +labs(title ="Age Distribution by Obesity Level", x ="Obesity Level", y ="Age") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The violin plot shows, more clearly, how individuals in the lower obesity categories, such as insufficient and normal weight, are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels exhibit a broader age range, with a peak density observed around 30–40 years, particularly in Obesity Type I and Type II. Severe obesity (Type III) is rare in younger individuals and becomes more prominent in the mid-adulthood age group. This visualization underscores the gradual progression of obesity risk with age and emphasizes the critical need for early intervention strategies to address weight-related health issues, particularly during early and mid-adulthood when such risks become more pronounced.
Age Distribution with SMOOTH Trend Line for Obesity Probability.
Code
ggplot(dataset, aes(x = age, y =as.numeric(obesity_lev))) +geom_jitter(alpha =0.3) +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="Trend of Obesity Level with Age", x ="Age", y ="Obesity Level") +theme_minimal()
The graph shows a smooth trend line capturing the overall pattern. Obesity levels increase significantly from adolescence to early adulthood, peaking around the 25–30 years age range. This period potentially represents a critical transition, where lifestyle factors such as reduced physical activity, higher caloric intake, and metabolic changes can contribute to the steep rise in obesity levels.
Beyond the peak, the trend shows a gradual decline in obesity levels after 30 years, which may reflect behavioral changes, such as increased health awareness, dietary improvements, or a selection bias in older age groups. This switch suggests that mid-20s to early-30s is a pivotal stage for interventions aimed at mitigating obesity risk.
2.0.0.1.2 Height
Descriptive statistic for Height.
Code
summary(dataset$height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 1.630 1.702 1.703 1.769 1.980
Code
sd(dataset$height, na.rm =TRUE)
[1] 0.09318594
Height distribution.
Code
ggplot(dataset, aes(x = height)) +geom_histogram(bins =20, fill ="purple", color ="black", alpha =0.7) +labs(title ="Height Distribution", x ="Height (m)", y ="Count") +theme_minimal()
The height histogram shows the height distribution (in meters) and is approximately normal, with a slight right skew. Most values fall between 1.45m and 1.98m, with a peak around 1.8m, indicating it’s the most frequent height. The range is realistic, with no visible extreme outliers, and the standard deviation (0.09) indicates low variability. I would like to add that the mean and median are both 1.7m, confirming a nearly symmetrical distribution.
Height by Obesity Level
Box Plot of Height by Obesity Level.
Code
ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +geom_violin(alpha =0.6) +labs(title ="Height Distribution by Obesity Level", x ="Obesity Level", y ="Height") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))
The plot shows for height, relatively low variability within each category, with overlapping ranges between most groups. Individuals with Insufficient Weight and Normal Weight have slightly narrower distributions, centered around similar heights (~1.7 m). As obesity levels increase (e.g., Obesity Type I–III), the distributions remain consistent, suggesting height is not strongly associated with obesity classification. This suggests that weight may be more influential than height alone in determining obesity level.
2.0.0.1.3 Weight
Descriptive statistic for Weight.
Code
summary(dataset$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
39.00 66.00 83.10 86.86 108.02 173.00
Code
sd(dataset$weight, na.rm =TRUE)
[1] 26.19085
Weight by gender
Density plot for weight distribution by gender.
Code
ggplot(dataset, aes(x = weight, fill = gender)) +geom_density(alpha =0.5) +labs(title ="Density Plot of Weight by Gender", x ="Weight", y ="Density") +scale_fill_manual(values =c("pink", "lightblue"), name ="Gender", labels =c("Female", "Male")) +theme_minimal()
The density plot reveals distinct weight distributions between genders. Females generally weight less, with a peak around 70 units, while males peak around 85 and 115 units, indicating a tendency toward higher weights. The overlapping region around 80-90 units shows weights common to both genders, but the distinct density peaks emphasize gender-based differences in weight distribution. Overall, males dominate at higher ranges Weight ranges from 39 to 173 units, with an average (mean) weight of 86.6 units. The median weight is 83 units, with a standard deviation of 26.2, indicating moderate spread.
Weight by obesity level
Ridgeline Plot of Weight by Obesity Level.
Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +geom_density_ridges(scale =0.9, alpha =0.6) +labs(title ="Ridgeline Plot of Weight by Obesity Level", x ="Weight", y ="Obesity Level") +theme_minimal() +theme(legend.position ="none")
This ridgeline plot shows a clear progression in weight distribution across different obesity levels. As the obesity level increases, the weight distribution shifts progressively to higher ranges. “Normal Weight” and “Insufficient Weight” categories are concentrated at lower weights, while higher obesity types (I, II, and III) peak at significantly greater weights, indicating a strong positive association between weight and obesity level The weight distribution has an average of 86.6 kg and a standard deviation of 26.6 kg.
2.0.0.1.4 Height and Weight
Scatter Plot (height vs weight), colored by obesity level.
Code
ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE, aes(group = obesity_lev)) +# Adds a trend line for each obesity levelggtitle("Scatter Plot of Weight vs Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level")
Facet Grid for Height and Weight by Obesity Level.
Code
ggplot(dataset, aes(x = height, y = weight)) +geom_point(alpha =0.7, aes(color = obesity_lev)) +facet_wrap(~ obesity_lev) +ggtitle("Facet Grid of Weight and Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level") +theme(legend.position ="none")
The scatter plot with trend lines for each obesity level reveals a clear positive correlation between weight and height across all obesity levels. As the obesity level increases, the slope generally becomes steeper, indicating a stronger weight gain relative to height. We created the facet grid to show more clearly the trends to show more clearly how The “Obesity_Type_III” (yellow) category has the steepest slope, suggesting a significant weight increase per unit of height, which is consistent with the highest level of obesity.
Correlation between height and weight.
Code
correlation_height_weight <-cor(dataset$height, dataset$weight, use ="complete.obs")correlation_height_weight
[1] 0.457468
The correlation observed between height and weight (r = 0.463) aligns with existing literature, confirming the expected positive relationship between these variables.
2.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levelsggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +labs(x ="Food Between Meals", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14))
Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +geom_bar(position ="fill") +# Stacked bar chart with proportionsscale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of Food Between Meals Across Obesity Levels") +# Shortened and clear titlelabs(x ="Obesity Levels", y ="Proportion (%)", fill ="Food Between Meals") +# Correct axis and legend labelstheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis text for readabilityplot.title =element_text(hjust =0.5, size =14) # Center and style the title )
These charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.
2.0.0.1.6 High-caloric food consumption
Code
# Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levelsggplot(dataset, aes(x = caloric_food, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle(" Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels") +labs(x ="High-Caloric Food Consumption", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
The dodged bar chart clearly shows that the majority of individuals, especially in the higher obesity categories (Obesity Type I–III), report consuming high-caloric foods (“yes”). This trend becomes increasingly pronounced as obesity levels rise, with very few individuals reporting “no” consumption in these categories. In contrast, lower obesity levels (e.g., Normal Weight, Overweight Level I) show a slightly higher representation of “no” responses, indicating a potential shift in dietary habits across obesity levels.
Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Proportions within each Obesity Level)ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +geom_bar(position ="dodge", aes(y = (..count..) /tapply(..count.., ..x.., sum)[..x..]), color ="black") +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +ggtitle(" Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +labs(x ="Obesity Levels", y ="Proportion (%)", fill ="High-Caloric Food Consumption") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14) )
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
The grouped bar chart effectively shows the behavioral shift toward higher high-caloric food consumption as obesity levels increase. High-caloric food consumption (“yes”) consistently accounts for over 75% of responses, becoming nearly universal in higher obesity categories (Obesity Type I–III). In contrast, “no” responses are more visible in lower obesity levels, such as Insufficient Weight and Normal Weight, but remain a minority.
More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, which may directly contribute to weight gain, highlighting the need for dietary interventions focused on reducing high-calorie intake.
2.0.0.1.7 Alcohol consumption
Frequence in consumption of alcohol.
Code
# Filter out "Always" responses from the datasetfiltered_dataset <- dataset %>%filter(freq_alcohol !="Always")# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +labs(x ="Alcohol Consumption Frequency", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
The chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.
Code
# Prepare the data summary for 'Sometimes' and 'No' responsesdata_summary <- dataset %>%filter(freq_alcohol %in%c("Sometimes", "No")) %>%group_by(obesity_lev, freq_alcohol) %>%summarise(count =n(), .groups ="drop") %>%group_by(obesity_lev) %>%mutate(total =sum(count),proportion = count / total ) %>%ungroup()# Visualization with updated titleggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +labs(x ="Obesity Level", y ="Proportion (%)", color ="Alcohol Frequency") +scale_color_manual(values =c("No"="purple", "Sometimes"="gold")) +# Improved color schemetheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14), # Center and style titlelegend.position ="top" )
The proportion of individuals who drink alcohol “Sometimes” increases with higher obesity levels, peaking in Obesity_Type_III. In contrast, the likelihood of abstaining from alcohol (“no”) decreases as obesity levels rise. This pattern suggests that moderate alcohol consumption may be associated with higher obesity levels, while abstention is more common among those with lower obesity levels.
A possible interaction to investigate later is between alcohol frequency and caloric food preference, as both behaviors seem linked to higher obesity levels. Exploring this could reveal if individuals with a preference for caloric foods and moderate alcohol consumption have a compounding effect on obesity risk. This investigation could help clarify whether combined lifestyle factors contribute more significantly to higher obesity levels than each factor alone.
Monitoring of the calories in the day.
Code
# Dodged Bar Chart for calorie_check by Obesity Levelsggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle(" Dodged Bar Chart for the check of the calories by Obesity Levels") +labs(x ="High-Caloric Food Consumption", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the title
Code
data_summary <- dataset %>%group_by(obesity_lev, calorie_check) %>%summarise(count =n(), .groups ="drop") %>%mutate(total =sum(count), proportion = count / total)ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +geom_line(size =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent) +scale_color_manual(values =c("no"="lightcoral", "yes"="lightblue")) +labs(title ="Proportion of Calorie Checking by Obesity Level", x ="Obesity Level", y ="Proportion", color ="Calorie Check") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The second graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.
2.0.0.1.8 Vegetable consumption
Code
ggplot(dataset, aes(x = vegetable_food)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightgreen", color ="black", alpha =0.6) +geom_density(color ="darkgreen", size =1) +ggtitle("Histogram and Density of Vegetable Food Consumption") +theme_minimal() +labs(x ="Vegetable Food Consumption", y ="Density")
Code
ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Vegetable Food Consumption", x ="Weight", y ="Vegetable Food Consumption") +theme_minimal() +coord_cartesian(xlim=c(40, 135), ylim=c(2, 3))
The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.
This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.
The trend reveals two possible key insights:
A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
2.0.0.1.9 Physical activity
Plot histogram and density.
Code
ggplot(dataset, aes(x = physical_act)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Physical Activity") +theme_minimal() +labs(x ="Physical Activity", y ="Density")
The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.
Violin plot by category.
Code
ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +# Replace 'obesity_lev' with any category variablegeom_violin(trim =FALSE) +geom_boxplot(width =0.1, color ="black", fill ="white") +ggtitle("Violin Plot of Physical Activity by Obesity Level") +theme_minimal() +labs(x ="Obesity Level", y ="Physical Activity") +theme(legend.position ="none") +theme(axis.text.x =element_text(angle =45, hjust =1))
Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.
The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.
2.0.0.1.10 Water consumption
Plot histogram and density for water consumption.
Code
ggplot(dataset, aes(x = ch2o)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Comsumption of Water") +theme_minimal() +labs(x ="CH2O", y ="Density")
This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.
Violin Plot by Gender.
Code
# Scatterplot with a LOESS trend lineggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Water Consumption", x ="Weight", y ="Water Consumption (ch2o)") +theme_minimal() +coord_cartesian(xlim=c(35, 135))
The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.
This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.
2.0.0.1.11 Technology utilization
Histogram with Density.
Code
ggplot(dataset, aes(x = use_tech)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightblue", color ="black", alpha =0.6) +geom_density(color ="blue", size =1) +labs(title ="Histogram and Density of Use of Technology", x ="Use of Technology", y ="Density") +theme_minimal()
The histogram shows a strong concentration at discrete values (0, 1, and 2), likely reflecting the original categorical nature of the data before SMOTE. This also aligns with observed density peaks.
Density of Use of Technology by Obesity Level.
Code
ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +geom_density(alpha =0.5) +labs(title ="Density of Use of Technology by Obesity Level", x ="Use of Technology", y ="Density") +theme_minimal()
The density plot reveals distinct peaks in technology usage across obesity levels, with some levels like Obesity_Type_III having higher peaks, suggesting varied levels of technology use in these categories.
Boxplot by Obesity Level.
Code
ggplot(dataset, aes(x = obesity_lev, y = use_tech, fill = obesity_lev)) +geom_boxplot() +labs(title ="Boxplot of Use of Technology by Obesity Level", x ="Obesity Level", y ="Use of Technology") +theme_minimal() +theme(legend.position ="none")
Technology use shows some variation across obesity levels, with certain categories (like Obesity_Type_I and Obesity_Type_III) showing higher median usage compared to others.
Scatter Plot with Age.
Code
ggplot(dataset, aes(x = age, y = use_tech)) +geom_point(alpha =0.4, color ="blue") +geom_smooth(method ="lm", color ="red", linetype ="dashed") +labs(title ="Scatter Plot of Use of Technology vs Age", x ="Age", y ="Use of Technology") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a noticeable negative correlation between age and technology use, indicating younger individuals tend to use technology more than older ones.
The ‘Use of Technology’ variable shows a clear trend where younger individuals use technology more frequently, as seen in the negative correlation with age. Differences in technology usage across obesity levels suggest it might have predictive value in distinguishing between levels, although the SMOTE algorithm has introduced interpolated values that blur strict categories. This variable could therefore help predict obesity levels, especially if technology usage patterns are indicative of lifestyle factors associated with obesity.
2.0.1 4. Analysis overview of statistical methods and model selection
In the present analytical endeavor, we plan to employe a regression model approach to elucidate the intricate dynamics between a set of independent variables, which serve as the predictors, and Obesity Level as a singular dependent variable which is the outcome.
The rationale behind the selection of regression modeling stems from its established robustness as a statistical methodology, particularly adept at unraveling and quantifying the interrelations among variables. This is paramount, considering our overarching objective to forecast outcomes and to meticulously evaluate the repercussions that alterations in the predictor variables may have on the target variable.
Based on our exploratory data analysis, indications of potential outliers emerged within our dataset. However, upon closer examination, these values represent extreme data points that remain plausible given the context of our study. Consequently, our approach involves building two regression models: one that includes these extreme values and one that excludes them. The objective is to examine the impact of these extreme data points on the predictive performance of the model, analyzing how their presence or absence influences the resulting predictions and model behavior.
The idea behind the adoption of regression analysis is twofold. Firstly, it affords a nuanced understanding of the extent to which each predictor influences the outcome. Secondly, it provides a suite of statistical metrics that facilitate the evaluation of the model’s capacity to elucidate the variance in the data. Through regression analysis, we can ascertain the presence of statistically significant linkages between the variables under scrutiny and quantify the magnitude and trajectory of these associations. This method endows us with coefficients that reflect the anticipated alteration in the dependent variable corresponding to a unit change in the predictors, whilst controlling for the constancy of other variables. That answers the first part of our research question. In top of that the regression model will permit us to ascertain the extent to which our independent variables account for the variability observed in the dependent variable (Assess Predictive Power), but we also will be able to delineate the individual impact magnitudes exerted by each predictor variable and to validate the statistical significance of these effects(Quantify Effects:).
At last, by integrating pertinent covariates and control variables into the model, we aim to attenuate biases and segregate the influence of the primary predictors on the outcome, thereby enhancing the accuracy of our findings.By looking at R², the P-values and the standardized coefficients we should be able to understand what are the key factors that can influence the weight condition of a person( obesity level).
To ensure the performance of the model we will need to check the linearity between the values, the normality of residuals and the homogeneity of Variance. And lastly we will check which non significant variable Variance Inflation Factor (VIF) is a statistical measure used to detect multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables in a model are highly correlated, meaning they contain redundant information. High multicollinearity can distort the estimates of coefficients, making it difficult to interpret the individual effect of each predictor.
We also want to build a predictive model. The EDA and the regression model will likely show that some of the key factors of our dataset are useful to make prediction about the type of weight someone will have. Once we identified relationships within our data, we aim to make reliable predictions about future outcomes. the regression will also help us understand which variables have the most significant impact on obesity level. And by using other statistical metrics Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R², we will assess how well the model performs and refine it as needed to improve accuracy.
2.0.2 5. Conclusion
So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.
The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.
These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.
2.1 Next Steps
Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.
2.2 Final Thoughts
Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.
Source Code
---title:Project Update Report (Group G): Code and Structureauthor: - Dorofieiev, Illia - Pizzi, Alessandro - Lovato, Andrea - El Abed, Aymaninstitute: University of Lausannedate: todaytitle-block-banner: "#0095C8" # chosen for the university of lausannetoc: truetoc-location: rightformat: html: number-sections: true html-math-method: katex self-contained: true code-overflow: wrap code-fold: true code-tools: true include-in-header: # add custom css to make the text in the `</> Code` dropdown black text: | <style type="text/css"> .quarto-title-banner a { color: #000000; } </style> pdf: # use this if you want to render pdfs instead include-in-header: # wrapping the code also in the pdf (otherwise, it overflows) text: | \usepackage{fvextra} \DefineVerbatimEnvironment{Highlighting}{Verbatim}{ commandchars=\\\{\}, breaklines, breaknonspaceingroup, breakanywhere }abstract: | This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.---```{r}#| label: setup#| echo: false#| message: false# loading all the necessary packagessource(here::here("src", "setup.R"))```::: {.callout-tip}### How to include sections separately- You can use `{include X}` to include different sections of your report as separate `.qmd` files. This is also well documented in the Quarto documentation: <https://quarto.org/docs/authoring/includes>- As mentioned in the documentation, we have used (_) prefix for the included files (e.g., `_introduction.qmd` and `_data.qmd`). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).- Rendering only `report.qmd` will render also all the other files.:::{{< include sections/_introduction.qmd >}}{{< include sections/_data.qmd >}}{{< include sections/_eda.qmd >}}{{< include sections/_analysis.qmd >}}{{< include sections/_conclusion.qmd >}}